Chris Rosewarne — Network Processors. Evaluating Architectures for Leading Edge Applications

Abstract

The need for higher layer processing in network equipment to implement flexible devices such as Load Balancers, Web Switches and deep packet processing for security applications has led to the creation of the Network Processor. These must be capable of processing packet data at wire-speed, which creates new challenges not previously seen in processor design. Many varied and innovative approaches have been taken to overcome these obstacles. This whitepaper firstly discusses the applications for Network Processors, elaborating this into some fundamental requirements of such devices. Secondly, the trade-offs in Network Processor design are discussed. Finally, a number of architectural approaches are described in the context of how well they perform in packet processing applications and the limitations each have. The implications on the adopted programming model are also discussed.

Introduction

In recent years, there has been a push to make a generic packet processor for use in network equipment. This has arisen due to demand for implementing new functionality without expending a large effort redesigning equipment. The aim is to process a stream of packets at wire-speed for various applications in the network. This has led to the creation of the Network Processor market. The fundamental premise of a Network Processor is to provide the flexibility of a programmable device with the performance of a fixed-function hardware device. This would shorten the Time-to-Market to deliver new functionality. Owing to the very open-ended nature of this problem space, the solution space has been equally diverse. This has led to numerous start-ups, in addition to the blue chip companies such as Intel and IBM, introducing their own architectures. At the peak of the internet boom many companies were working on these devices. In the intervening years most have left this market, leaving a small number with real silicon to deploy. After the initial hype surrounding these devices, and the subsequent low, the Network Processor market is now resurging as vendors try to deliver on the promise that these devices hold. Over the period of second half 2002 and first half 2003, Network Processor revenue was estimated at just $US61M, however the forecast for 2006 is $US190M1.

Depending on where the Network Processor is located, in the Core, Edge or Access equipment, its throughput requirements will vary drastically. In addition, the types of application they are used for differ greatly. At the Core, Network Processors are mainly used in Routers. In Edge equipment, the applications include Load Balancers. In Access equipment, the application may be a Wireless Basestation or DSLAM. Network Processors may also be used to enable advanced statistics collection and can be used for billing applications, in any equipment category. For all these examples, the Network Processor will enable the basic function of the device to utilise data at higher layers, enabling it to function more intelligently.

In these applications, the Network Processor sits in the data path and therefore needs to process packets at wire-speed without loss. In recent years, network data rates have increased at a phenomenal pace, far outstripping Processor rate increases, resulting in conventional processor architectures becoming even less suitable for these applications.

There are many options available to realise the functionality required of a Network Processor. At one end of the scale, a custom Application Specific Integrated Circuit (ASIC) could be designed around the end application. This gives the maximum performance owing to the increased flexibility during design to optimise the specific task being performed. However, this provides only a fixed-function solution, so any change in requirements, such as support for a new protocol would mandate a silicon re-spin. At the other end of the scale, General Purpose Processor (GPP) cores could be used to perform the packet processing, providing a maximum of flexibility but insufficient performance for most applications. Instead, an intermediate approach is required, such as more advanced techniques to use multiple existing cores or the creation of new cores optimised around the functions typically required for packet processing. Also, a standard processor core may be augmented with co-processors designed for specific tasks, taking the load off the main processor. In practice, a combination of approaches will yield acceptable performance for a given application. In creating these architectures, fundamental assumptions about the data-flow in a processor need to be revisited to reach the requirement of wire-speed packet processing.

Network Processors in the Network

Applications for Network Processors

The various applications for a Network Processor affect the required processing of packets at different layers of the Open System Interconnection (OSI) Protocol Stack. Asynchronous Transfer Mode (ATM) protocol support is less relevant to Network Processors currently, due to the ubiquity of TCP/IP, although it may still be used to transport TCP/IP for instance in DSL applications.

One application for a Network Processor at the Core is an IP Router. In this instance, it is necessary to inspect the packet header destination IP address field and perform a table look-up to determine which output port the packet should be directed to. This is a Layer 3 function, characterised by very high packet rates but no inter-packet dependency, which is often described as ‘Data Plane’ processing. Functions which involve unravelling the protocol transactions at higher layers are referred to as ‘Control Plane’ processing. In a Router, the Control Plane processing required involves implementing the protocols to update Route tables.

The Uniform Resource Locator (URL) has already been resolved to an IP address and a connection established with the Web Server. Note that the Options portion of the IP Packet is variable length, meaning that the offset of the HTTP GET is not fixed relative to the IP packet start. A Router will normally operate on only the IP Packet header.

Each Router decrements the Time-To-Live (TTL) field and updates the packet’s checksum field. Once the TTL field reaches zero the packet is discarded, preventing packets from looping infinitely.

Currently IP Version 4 (IPv4) is the dominant version in use, with a 32-bit field for an IP address. Due to the way IP addresses are allocated, there is a shortage of available IP addresses for new use. A newer protocol, IPv6 promises to solve this problem with 128- bit IP addresses. The implication on IPv6 Routers is much wider Route Tables.

To avoid each Route Table having an entry for every single IP address it might encounter, a method known as ‘Longest Prefix Matching’ is used, whereby certain bits of the IP address can be specified as “don’t cares” in the routing table. This allows a single routing entry to be used for multiple IP addresses.

Another approach taken to routing packets is the use of Multi Protocol Label Switching (MPLS) Networks. At the boundary of the MPLS Network, a Label Edge Router (LER) attaches a label to each packet, which is removed as packets leave the MPLS Network. Within the MPLS Network the label defines a path or set of paths that the packet may take5. This allows efficient forwarding of packets without requiring a very large route table, as packets from many different IP addresses may share the same label. This method is suited to high-speed implementation in Core networks. In this case a Network Processor performs the label insertion and removal at the MPLS Network boundaries6.

Web-sites with very high traffic often use multiple servers to split the load between them. A Network Processor can be used at the input to distribute traffic to each server, acting as a Load Balancer. There are several ways of distributing the traffic. If there is no requirement that a particular user always accesses the same server, then distribution can be made purely based on load. If a given user must always access the same server to keep track of that session, the Network Processor could keep track of each TCP session and assign a server during setup. Alternatively, a hash of the Source IP address and Port number can be used to select the server; this will also result in a given session always being directed to the same Server.

The EZchip NP-1c provides an example of use as a Load Balancer or Content Switch7. In this, the HTTP GET Method is intercepted and the URL located by searching for the ‘/’ character in the TCP segment. Refer to Figure 4 for an example packet. The URL is then matched against the search memory to determine what action to take, such as forwarding to a particular port. In a QoS provisioning role, the Network Processor can be used to enforce policy. A corporate user will have a Service Level Agreement (SLA) with their ISP and the ISP must set their policy to meet this SLA8. The first task is recording statistics on the incoming stream flows and secondly discarding packets that do not fit the specified criteria. This provides a way to guarantee performance levels for each customer. Network Processors may be used as Firewalls, due to their ability to look into the packet; they can reject packets that contain certain content, enabling smarter firewalls for applications such as Spam filtering, Virus Detection and implementing Intrusion Detection Systems. By monitoring traffic patterns, Denial-of-Service (DOS) attacks may be detected. Once detected, the packets involved may be dropped, preventing overload of the server being targeted by the attack.

Performance Requirements of Network Processors

A Network Processor must be able to process and forward packets passing through at wire speed. It has no ability to stall incoming data if it needs to catch up, and so must be designed at the outset to handle worst-case conditions. At the lower layers (2 and 3) there is no concept of a connection, therefore processing that is limited to these layers has no need to maintain any state information on connections. This level of processing is the domain of Routers. In some cases a Network Processor used for this processing may examine higher layer data to improve low level decisions, such as in Load Balancing or acting as a Firewall. Even so, removing the need to update connection state information drastically eases the processing load. Processing at this layer is referred to as the ‘Data Plane’. Where processing at higher layers is required, it is necessary to perform more complex protocol analysis, which requires storing state information on individual connections. This is categorised as ‘Control Plane’ processing. To monitor user sessions, the TCP 3- way handshake used to initiate a session must be captured by the Network Processor.

To store route tables or information on TCP connections, Look-up Tables are used. These must be searchable, enabling output ports or TCP connections to be associated with a given IP address and TCP Port number. This must be performed for each incoming packet, under worst case conditions such as shortest packet lengths or greatest rate of session set-up and tear-down. The latency of sending packets across a network is important for some applications, less so for others. Traditional uses of the Internet for data transfer put no stringent requirements on latency. Each packet is buffered before being switched in a Router, adding some delay. Also, each link contains a stream of variable length packets and depending on the other traffic mix the latency will vary, this is known as packet jitter. Newer multimedia applications such as Voice over IP (VoIP) place greater constraints on network latency and jitter. Network latency results in pauses between each participant speaking in a conversation which is perceived as an annoyance. Variations in latency result in late packets being dropped due to the real-time nature of voice. The resulting clipping of a conversation is perceived as a significant degradation in quality. Therefore as the packets traverse the network, the latency added by each stage must be minimal. This implies that packets must be processed and forwarded promptly, and not buffered for excessive periods of time.

Underlying Technologies

The underlying technology used to transport data is Fibre Optics. This has very high speed, up to 40Gb/s, and requires Repeaters at fixed intervals. There are many different laser technologies available offering different ‘reaches’ or distances before a repeater or the receiver is required. It is also possible to multiplex different wavelengths over a single fibre, increasing the throughput. As data propagates over Optical Fibre, the main source of delay is in the Repeaters, the delay over the fibre is minimal. Even over very large distances the delay introduced is quite small relative to the delay of processing packets in the Network Equipment. Silicon integrated circuits are much slower than Optic Fibre so in order to reach wire-speed parallelism at the hardware level is required. This introduces complexities into the design and architecture of network equipment.

Challenges in Network Processor design

The purely hardware solution to network processing is to create an ASIC that is specifically designed to perform a limited set of tasks. This requires a very large design effort and the high Non Recurring Expense (NRE) associated with the ASIC design. Also, newer processes that offer the required speed have very high costs associated with manufacture, not to mention verification of such complex designs. These costs must be amortised over the relatively small set of applications that the ASIC is targeted at9. Technologies such as Field Programmable Gate Arrays (FPGAs) may be used to significantly reduce the NRE and mitigate the risk of design errors due to the inherent reprogrammability these devices offer. Use of FPGA technology does incur a performance penalty. Although modern FPGA families support very high clock rates, they are still slower than an ASIC. However they have large amounts of available logic and often have other features such as high-speed I/O. Even so, any change to the functionality requires reworking the design at a very low level, requiring many manmonths of effort.

At the opposite end of the spectrum, GPP devices may be employed for packet level processing. These can easily implement very complex protocols, making them well suited to high layer processing, especially stateful processing of connections. The drawback is in performance; a GPP is poorly suited to performing many common operations on packets, generally in the realm of bit-field manipulation, searching and hashing functions. Moving large amounts of data around is extremely difficult due to the bus bottleneck, even when Direct Memory Access (DMA) is employed, as one bus is shared between instructions and data. Currently available Network Processors use a hybrid of these two approaches to realise their functionality. Storage of route tables or other information required by the Network Processor generally requires a large amount of off-chip memory. As with conventional processor architectures, memory technologies offering higher capacity also have longer access times. Synchronous Dynamic RAM (SDRAM) can provide very large amounts of memory, but with a relatively long latency. Access is improved using Double Data Rate (DDR) techniques, but it still lags behind Static RAM (SRAM) for access speed. A common approach is to have a memory hierarchy, where data structures are stored in memory appropriate to its access requirements. A relatively small onboard memory will provide very high speed access. Search Table memory often uses Content Addressable Memory (CAM). A CAM returns the address that a value of data was found at, searching all locations simultaneously. This provides a searchable memory, which greatly speeds up table look-up operations, and is commonly used in IP Routers. CAM is very expensive and power hungry, but well suited to searching a large table extremely quickly. With such a high cost associated with CAM, its use needs to be minimised in the system architecture.

operating on Layers 2 and 3 may not have any inter-packet dependency. For each packet that is received, a particular set of operations is executed regardless of any other packets in the stream. This may still be the case if the processing involves looking into the packet payload (and hence the higher layers) for applications such as a URL Load Balancer. The elimination of inter-packet dependency has many consequences for the Network Processor architecture. If a CAM were used to associate IP addresses and port numbers with connections, tracking a large number of connections being set-up and torn-down would create very high bandwidth just to keep the memory up to date. If there is no need to maintain state information on connections the bandwidth to search memory is reduced. Elimination of inter-packet dependency also facilitates a greater degree of parallelism in packet processing. Where two processors received packets from the same stream, one would have to stall while the other processed the first packet or the state information would need to be accessible to both processors. A simpler scheme requires that packets are dispatched in-order to a single processor, requiring a more complex packet dispatcher capable of classification. With no inter-packet dependency, the need to direct flows of packets to a particular processor to ensure in-order processing is eliminated. At the core of the Network Processor, packet data is transported on a bus of fixed width and speed. The shortest packet size used in IP is 40 bytes. When transported over Packet-over-SONET (POS), header insertion adds 8 bytes. At 10Gb/s rates this translates into approximately 26 million packets per second or one every 38 nanoseconds. To increase the bandwidth of data transfer, the bus width can be increased. This creates difficulties as packets that do not fit neatly into the bus width will result in large amounts of unused bus capacity. For example, a 40 byte IP packet on a 256-bit bus will take two clock cycles to transfer. The first clock cycle will transfer 32 bytes (the maximum) and the second only 8-bytes, resulting in 24 bytes of unused bus capacity that cycle. This leads to a bus utilisation of 62.5%. To maintain the maximum packet rate, the clock rate must be increased to accommodate the worst case unused bandwidth scenario. Additionally, wider busses increase the routing congestion inside the chip. Where the busses must interface to external memories the chip pin-count is greatly increased, as is the board-level routing congestion. CAMs are very expensive and use a lot of power. However there are alternative ways to implement search functions. One such method is the use of Hash algorithms which reduce look-up table size. These provide a means to map a large set of entries onto a smaller set of entries10. This implies a many-to-one relationship and hence the possibility of collisions. A Hash table may be implemented in RAM, resulting in significant cost savings. Schemes to work around collisions require multiple accesses to the memory, so the search time is not deterministic. Supporting Longest Prefix Matching, required in any router, adds even more complexity to this scheme.

Architectures of Network Processors

This section discusses a number of processor design techniques in the context of how well they address the challenges in Network Processor applications. Some of these techniques are borrowed from conventional high performance GPPs, whereas others are targeted around packet processing applications. These distinct approaches to Network Processor design are used either singularly or in combination by real designs. As the approaches are discussed, examples of currently available chips are provided to elaborate the descriptions.

Memory Hierarchy

In a conventional GPP design, there exists a memory hierarchy where very fast memory is tightly coupled to the GPP to augment slower and larger system memory. The fast local memory caches data from the slower memory. Virtual memory using disk space extends this concept further. This model works well when the data accesses have a high locality, meaning that subsequent accesses tend to be at nearby locations, and occur near each other in time (temporal locality). In a Network Processor, this may be true of the instruction stream, but it is not true of the data stream. There is a continual flow of packets arriving and leaving, so no locality is present. This requires that the cache be bypassed and replaced with a means of delivering packets directly to the processor as quickly as possible.

Packet Data Path

From the processor’s point of view, for full-duplex processing it expects to see two buffers in memory; an Ingress buffer and an Egress buffer. The Ingress buffer contains packets received from the physical interface to be forwarded to the Switch Fabric interface. The Egress buffer contains packets received from the Switch Fabric to be forwarded to the physical interface. Although these could be updated using a DMA process, this will take bus cycles away from the Processor. Alternatively, the buffers could be implemented using Dual Port RAM (DPRAM), which allows the processor to simultaneously access packets in memory while other packets are being written to or read from the memory. It is the job of the processor to perform the necessary manipulation of each packet before it is forwarded. Signalling between the Processor and the packet receiving and forwarding engine is necessary to flag packets as ready-toprocess and ready-to-send.

Conclusion

This whitepaper has discussed a number of applications for Network Processors. Applications ranging from the Core to Access to Edge portions have been presented. Depending on the level of protocol analysis involved, requirements for these devices have been extracted from the applications. From these requirements a number of architectural considerations have been drawn. An elaboration on the various architectural approaches has been provided, including an evaluation of how each approach assists in packet processing. Also, the ways that these devices are programmed has been discussed, taking into consideration various architectures. Finally, emerging trends in the Network Processor market have been discussed, taking into account other new technologies that are becoming available. With such a wide range of architectures to choose from, Network Equipment vendors must take careful consideration of their application before selecting a device. As the range of architectures demonstrates, Network Processors can not be considered as interchangeable black boxes. Knowledge of the various architectures will enable an informed decision regarding which device is best suited to the application.

References

1 Wheeler, B. Gwennap, L. “A Guide to Network Processors, Fifth Edition”, Linley Group, http://www.linleygroup.com/pdf/NPU_v5.pdf.

2 Agilent Technologies, “RouterTester at your Service”, Advanced Networks Division, http://advanced.comms.agilent.com/ routertester/member/technology/edge/RouterTester-at-your- Service.pdf.

3 Dominic Herity, “Network Processor Programming”, Embedded.com, http://www.embedded.com/story/OEG20010730S0053.htm.

4 Meyer, D. “University of Oregon Route Views Project”, Advanced Network Technology Center web site, http://www.antc.uoregon.edu/route-views.

5 Shah, N. “Understanding Network Processors”, Department of Electrical Engineering and Computer Science, University of Berkeley. http://www.cs.berkeley.edu/~plishker/UnderstandingNPs.pdf.

6 EZchip Technologies, “Reducing Router Chip Count, Power and Cost by 80%”, http://www.ezchip.com/images/pdfs/NP-1_classification_whpaper.pdf.

7 EZchip Technologies, “Implementing Layer 4-7 Switches using the NP-1c Network Processor”, http://www.ezchip.com/images/pdfs/EZchip_L4-7_Switches.pdf.

8 Rand, L. “Security Policy Enforcement For Networks”, Netboost, http://www.itsecurity.com/papers/policyenf.htm.

9 Chandra, V. “Selecting a network processor architecture”, IBM Microelectronics, http://www- 306.ibm.com/chips/techlib/techlib.nsf/techdocs/EF166316618D520187256C3F005C10C1.

10 Morris, J. “Data Structures and Algorithms – Hash Tables”, Centre for Intelligent Information Processing Systems, Department of Electrical and Electronic Engineering, University of Western Australia. http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/hash_tables.html.

11 IBM Corporation, “IBM PowerNPTM NP4GS3 Network Processor”, http://www- 306.ibm.com/chips/techlib/techlib.nsf/techdocs/852569B20050FF7785256983006A3809.

12 Crowley, P. Fiuczynski, M. Baer, J. Bershad, B. “Characterizing Processor Architectures for Programmable Network Interfaces”. Department of Computer Science and Engineering, University of Washington. http://www.cs.washington.edu/homes/pcrowley/papers/ics00-final.pdf.

13 Intel Corporation, “Intel® IXP1250 Network Processor”, http://www.intel.com/design/network/datashts/278371.htm.

14 Intel Corporation, “Intel® IXP2850 Network Processor”, http://www.intel.com/design/network/prodbrf/252136.htm.

15 EZchip Technologies, “Network Processor Designs for Next Generation Networking Equipment”, http://www.ezchip.com/images/pdfs/ezchip_white_paper.pdf.

16 Henriksson, T. “Review of ‘C Compiler Design for a Network Processor’”. Department of Electrical Engineering, Linkopings universitet. http://www.ida.liu.se/~chrke/courses/ACC/NPCompiler.pdf.

17 Agere Systems, “The Case for a Classification Language”. http://www.agere.com/enterprise_metro_access/docs/classificationwhitepaper.pdf.

18 Chandra, P. Yavatkar, R. “Programming Network Processors – Controlling the Beast”. Intel Communications Group, Intel Corporation. http://www.commdesignconference.com/db_area/cdc03/papers/P247Yavatkar.pdf.

19 PICMG, “AdvancedTCA PICMG 3.0 Shortform Specification”. http://www.picmg.org/pdf/PICMG_3_0_Shortform.pdf.

20 Mello, B. Schapfel, F. “It’s Round 2 for Network Processors”. Intel Corporation, Network Processor Division, http://www.eetimes.com/in_focus/embedded_systems/OEG20040205S0018.